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Abstract 



Estimating the eigenvalues of a population covariance matrix from a sample covariance matrix is a 
problem of fundamental importance in multivariate statistics; the eigenvalues of covariance matrices play 
a key role in many widely techniques, in particular in Principal Component Analysis (PCA). In many 
modern data analysis problems, statisticians are faced with large datasets where the sample size, n, is 
of the same order of magnitude as the number of variables p. Random matrix theory predicts that in 
| this context, the eigenvalues of the sample covariance matrix are not good estimators of the eigenvalues 

of the population covariance. 

We propose to use a fundamental result in random matrix theory, the Marcenko-Pastur equation, to 
better estimate the eigenvalues of large dimensional covariance matrices. The Marcenko-Pastur equation 
holds in very wide generality and under weak assumptions. The estimator we obtain can be thought of 
as "shrinking" in a non linear fashion the eigenvalues of the sample covariance matrix to estimate the 
population eigenvalue. Inspired by ideas of random matrix theory, we also suggest a change of point of 
view when thinking about estimation of high-dimensional vectors: we do not try to estimate directly 
the vectors but rather a probability measure that describes them. We think this is a theoretically more 
fruitful way to think about these problems. 

Our estimator gives fast and good or very good results in extended simulations. Our algorithmic 
approach is based on convex optimization. We also show that the proposed estimator is consistent. 

c3 ■ 1 Introduction 



With data acquisition and storage now easy, today's statisticians often encounter datasets for which 
the sample size, n and the number of variables p, are both large: in the order of hundreds, thousands, 
millions, or even billions in situations such as web search problems. 

The analysis of these datasets using classical methods of multivariate statistical analysis requires some 
care. While the ideas are still relevant, the intuition for the estimators that are used and the interpretation 
of the results are often - implicitly - justified by assuming an asymptotic framework of p fixed and n 
growing infinitely large. This assumption was consistent with the practice of statistics when these ideas 
were developed, since investigation of datasets with a large number of variables was very difficult. A better 
theoretical framework for modern - i.e large p - datasets, however is the assumption of the so-called "large 
n, large p" asymptotics. In other words, one should consider that both n and p go to infinity, perhaps 
with the restriction that their ratio goes to a finite limit 7, and draw practical insights from the theoretical 
results obtained in this setting. 
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We will turn our attention to an object of central interest in multivariate statistics: the eigenvalues 
of covariance matrices. A key application is Principal Components Analysis (PCA), where one searches 
for a good low dimensional approximation to the data by projecting the data on the "best" possible k 
dimensional subspace: here "best" means that the projected data explain as much variance in the original 
data as possible. This amount of variance explained is measured by the eigenvalues of the population 
covariance matrix, S p , and hence we need to find a way to estimate those eigenvalues. We will discuss in 
the course of the paper other problems where the eigenvalues of S p play a key role. 

We take a moment here to give a few examples that illustrate the differences that occur under the 
different asymptotic settings. To pose the problem more formally, let us say that we observe iid random 
vectors X%, . . . , X n in W, and that the covariance of AQ is E p . We call X the data matrix who se rows are 
the X jS. In the classical context, where p is fixed and n goes to oo, a fundamental result of ((Anderson! 
1963) says that the eigenvalues of the sample covariance matrix S p = (X — X)'(X — X)/(n — 1) are good 
estimators of the population eigenvalues (i.e the eigenvalues of T, p ). More precisely, calling ij t he ordered 
eigen values of S p {l\ > h ■ ■ ■) and Aj the ordered eigenvalues of S p (Ai > A2 • • •), it was shown in (|Anderson , 



1963) that 



Vn(k - Aj) =>■ AA(0, 2\f) , 



when the Aj are normally distributed and all the Aj's are distinct. This result provided rigorous grounds 
for estimating the eigenvalues of the population covariance matrix, S p , with the eigenvalues of the sample 
covariance ma trix, S v , when p is small compared to n. (For more details on Anderson's theorem, we refer 
the reader to (jAndersonl l2003h Theorem 13.5.1.) 

Shifting assumptions to "large n, large p" asymptotics induces fundamental differences in the behavior 
of multivariate statistics, some of which we will highlight in the course of the paper. As a first example, let 
us consider the c ase where T, p = ld p , so all the population eigenvalues are equal t o 1. A result firs t shown 
m under some moment growth assumptions, and later refined in (jYin et all Il988l ). states 

that if the entries of the AVs are i.i.d and have a fourth moment, and \ipjn — > 7, then 

h -» (1 + Vl? a - s - 

In particular, l\ is not a consiste nt estimator of A1 . No te that by picking n = p, 1% tends to 4 whereas 
Ai = 1. (For more general £ p , see (|E1 KarouiL l iTo Add earl ) Section 4.3 for numerically explicit results about 
the limit of 

As the case of S p = Id p illustrated, when n and p are both large, the largest sample eigenvalue is 
biased, sometime dramatically so. Hence, we should correct this bias in the largest sample eigenvalue(s) 
if we want to use them in data analysis. Theoretical results predict that the behavior of extreme sample 
eigenvalues can be quite subtle; in particular, depending on how far an isolated population eigenvalue is 
from the bulk of the population spectrum, the corresponding sample eigenvalue can either be isolated, and 
far away from the b ulk of the sample eigenv al ues, or be absorbed by the b ul k of the sample eig envalues (see 



tar away tro m the b ulk ot the sample eigenv al ues, or be absorbed by trie b ul k ot th e sample eigenvalues (see 
( Raik et allbood ). (|E1 KaronilfTo Appear!) . (|Raik and Silversteinl Eool . (|Panll . lTo Appear!) ). One thing 
is however clear from the most recent theoretical results : if we wish to de-bias extreme sample eigenvalues, 
we need an accurate estimate of the so-called population s pectral dist ribution, a probability measure that 
characterizes the population eigenvalues (see ( El KarouiL ITo Appear! )) . This is what our algorithm will 
deliver. 

We have so far mostly discussed extreme sample eigenvalues. However, much is also known about the 
behavior of the whole vector of sample eigenvalues I2, ■ ■ ■ , l p ) and its asymptotic behavior. In particular, 
theory predicts tha t in the "large n, la rge p" case, the scree plot (i.e the plot of the sample eigenvalues 
. their rank; see (jMardia et all Il97flh ) becomes uninformative and deceptive. What we propose in this 



vs 

paper is to use random matrix theory to develop practically useful tools to remedy the flaws appearing in 
some widely used tools in multivariate statistics. 

Before we discuss how we will go about it, let us briefly discuss some issues that arise when estimating 
vectors of large dimension, since working in an asymptotic setting where p — > 00 is not without additional 
difficulties. Since we will try to estimate vectors of increasingly larger and larger size, an appropriate 
notion of convergence is needed if we want to quantify the quality of our estimators. Standard norms in 
high-dimensions not necessarily a very good choice: for instance, if we are in M 100 , and make an error of 
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size 1/100 in all coordinates, the resulting l\ error is 1, even though, at least intuitively, it would seem like 
we are doing well. Also, if we made a large error (say size 1) in one direction, the I2 norm would be large 
(larger than 1 at least), even though we may have gotten the structural information about this vector (and 
almost all its coordinates) "right". Inspired by ideas of random matrix theory, we propose to associate 
to high-dimensional vectors probability measures that describe them. We will explain this in more detail 
in Section 12.11 After this change of point of view, our focus becomes trying to estimate these measures. 
Why choosing to estimate measures? The reasons are many. Chief among them is that this approach will 
allow us to look into the structure of the population eigenvalues. For instance, we would like to be able 
to say whether all population eigenvalues are equal, or whether they are clustered around say two values, 
or if they are uniformly spread out on an interval. Because the ratio p/n can make the scree plot appear 
smooth (and hence in some sense uninformative) regardless of the true population eigenvalue structure, 
this structural information is not well estimated by currently existing methods. We discuss other practical 
benefits (like scalability with p) of the measure estimation approach in 13. 3. 71 In the context of PC A, where 
usually the concern is not to estimate each population eigenvalues with very high precision, but rather to 
have an idea of the structure of the population spectrum to guide the choice of lower-dimensional subspaces 
on which to project the data, this measure approach is particularly appealing. Examples to come later in 
the paper will illustrate this point. 

Random matrix theory plays a key role in our approach to this measure estimation problem. A main 
ingredient of our method is a fundamental result, which we call the Marcenko-Pastur equation (see Theorem 
Ul, which relates the asymptotic behavior of the sample eigenvalues to the population eigenvalues. The 
assumptions under which the theorem holds are very weak (a fourth moment condition) and hence it is very 
widely applicable. Until now, this theorem has not been used to do inference on population eigenvalues. 
Partly this is because in its general form it has not received much attention in statistics, and partly because 
the inverse problem that needs to be considered is very hard to solve if it is not posed the right way. We 
propose an original way to approach inverting the Marcenko-Pastur equation. In particular, we will be 
able to estimate given the eigenvalues of the sample covariance matrix S p the probability measure, H p , 
that describes the population eigenvalues. We use the standard names empirical spectral distribution for 
F p and population spectral distribution for H p . It is important to state clearly what asymptotic framework 
we place ourselves in. We will consider that when p and n go to infinity, H p stays fixed. In particular, it 
has a limit, denoted floo- We call this framework "asymptotics at fixed spectral distribution". Of course, 
fixing Hp does not imply that we fix p. For instance, sometime we will have H p = 61, for all p. Since the 
parameter of interest in our problems is really the measure H p , the fixed spectral distribution asymptotics 
corresponds to classical assumptions for parameter estimation in statistics, where the parameter does not 
change with the number of variables observed. We refer the reader to 13.3.61 for a more detailed discussion. 

To solve the inverse problem posed by the Marcenko-Pastur equation, we propose to discretize the 
Marcenko-Pastur equation and then use convex optimization methods to solve the discretized version of 
the problem. In doing so, we obtain a fast and provably accurate algorithm to estimate the population 
parameter of interest, H p , from the sample eigenvalues. The approach is non-parametric since no assump- 
tions are made a priori on the structure of the population eigenvalues. One outcome of the algorithm is an 
efficient graphical method to look at the structure of the population eigenvalues. Another outcome is that 
since we have an estimate of the measure that describes the population eigenvalues, standard statistical 
ideas then allow us to get estimates of the individual population eigenvalues A,. Some subtle problems 
may arise when doing so and we address them in 13.3.61 The final result of the algorithm can be thought 
of as performing non-linear shrinkage of the sample eigenvalues to estimate the population eigenvalues. 

We want to highlight two contributions of our paper. First, we propose to estimate measures associated 
with high-dimensional vectors rather than estimating the vectors. This gives rise to natural notions of 
consistency and accuracy of our estimates which are reasonable theoretical requirements for any estimator 
to achieve. And second, we make use, for the first time, of a fundamental result of random matrix theory 
to solve an important practical problem in multivariate statistics. 

The rest of the paper is divided into four parts. In Section we give some background on results in 
Random Matrix Theory that will be needed. We do not assume that the reader has any familiarity with 
the topic. In Section 01 we present our algorithm to estimate H p , the population spectral distribution, 
and also the population eigenvalues. In Section |1J we present the results of some simulations. We give in 
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Section |S] a proof of consistency of our algorithm. The Appendix contains some details on implementation 
of the algorithm. 

A note on notation is needed before we start: in the rest of the paper, p will always be a function of n, 
with the property that p(n)/n — ► 7 and 7 S (0, 00). To avoid cumbersome notations, we will usually write 
p and not p(n). 

2 Background: Random matrix theory of sample covariance matrices 

There is a large body of work concerned with the limiting behavior of the eigenvalues of a sample 
covariance matrix when p and n both go to 00; it constitutes an important subset of what is commonly 
known as Random Matrix Theory, to which we now turn. This is a wide area of research, of which we 
will only give a very quick and self-contained overview. Our eventual aim in this section is to introduce a 
fundamental result, the Marcenko-Pastur equation, that relates the asymptotic behavior of the eigenvalues 
of the sample covariance matrix to that of the population covariance in the "large n, large p" asymptotic 
setting. The formulation of the result requires that we introduce some concepts and notations. 

2.1 Changing point of views: from vectors to measures 

One of the first problems to tackle is to find a mathematically efficient way to express the limit of a 
vector whose size grows to 00. (Recall that there are p eigenvalues to estimate in our problem and p goes 
to 00.) A fairly natural way to do so is to associate to any vector a probability measure. More explicitly, 
suppose we have a vector (y%, . . . , y p ) in W. We can associate to it the following measure: 

1 v 

dGp(x) = -J2S yi (x) . 
P i=l 

G p is thus a measure with p point masses of equal weight, one at each of the coordinates of the vector. 

In the rest of the paper, we will denote by H p the spectral distribution of the population covariance 
matrix E p , i.e the measure associated with the vector of eigenvalues of T, p . We will refer to H p as the 
population spectral distribution. We can write this measure as 

1 P 

dH p (x) = - ^^xM) > 
P i=l 

where 5\ t is a point mass, of mass 1, at Aj. We also call 6\ t a "dirac" at Aj. The simplest example of 
population spectral distribution is found when E p = Id p . In this case, for all i, Aj = 1, and dH p = Si. So 
the population spectral distribution is a point mass at 1 when S p = Id p . 

Similarly, we will denote by F p the measure associated with the eigenvalues of the sample covariance 
matrix S p . We refer to F p as the empirical spectral distribution. Equivalently, we define 

dF p {x) = - Y.S h {x) ■ 

P i=l 

The change of focus from vector to measure implies a change of focus in the notion of convergence we 
will consider adequate. In particular, for consistency issues, the notion of convergence we will use is weak 
convergence of probability measures. While this is the natural way to pose the problem mathematically, 
we may ask if it will allow us to gather the statistical information we are looking for. An example of the 
difficulties that arise is the following. Suppose dH p = (1 — 1/p) Si + l/pS2- In other words, the population 
covariance has one eigenvalue that is equal to 2 and {p — 1) that are equal to 1. Clearly, when p — > 00, 
Hp weakly converges to H^, with dH^ = Si. So all information about the large and isolated eigenvalue 
2, which is present in H p for all p and is naturally of great interest in PCA, seems lost in the limit. This 
is not the case when one does asymptotic at fixed spectral distribution and consider that we are following 
a sequence of models which are going to infinity with H p = H po = H^, where po is the p which is given 
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by the data set. Fixed distribution asymptotics is more akin to what is done in classical statistics and 
we place ourselves in this framework. We refer the reader to 13. 3, HI for a more detailed justification of our 
point. 

In other respects, associating a measure to a vector in the way we described is meaningful mostly when 
one wants to have information about the whole set of values taken by the coordinates of the vector, and not 
about each coordinate. In particular, when going from vector to measure as described above we are losing 
all coordinate information: permuting the coordinates would drastically change the vector but yield the 
same measure. However, in the case of vectors of eigenvalues, since there is a canonical way to represent 
the vector (the i-th largest eigenvalue occupying the i-th. coordinate), the information contained in the 
measure is sufficient. This measure approach is especially good when we are not focused on getting all the 
fine details of the vectors right, but rather when we are looking for structural information concerning the 
values taken by the coordinates. 

An important area of random matrix theory for sample covariance matrices is concerned with under- 
standing the properties of F p as p (and n) go to oo. A key theorem , which we review later (see Theorem 

states that for a wide class of sample covariance matrices, F^, the limit of F p , is asymptotically non- 
random. Furthermore, the theorem connects to H^, the limit of H p : given -Hoo> we can theoretically 
compute -Fqo, by solving a complicated equation. In data analysis, we observe the empirical spectral distri- 
bution, F p . Our goal, of course, as far as eigenvalues are concerned, is to estimate the population spectral 
distribution, H p . Our method will "invert" the relation between Fao and H^, so that we can go from F p 
to H p , an estimate of H p . The method does not work directly with F p but with a tool that is similar in 
flavor to the characteristic function of a distribution: the Stieltjes transform of a measure. We introduce 
this tool in the next subsection. As we will see later, it will also play a key role in our algorithm. 



2.2 The Stieltjes transform of measures 

A large number of results concerning the asymptotic properties of the eigenvalues of large dimensional 
random matrices are formulated in terms of limiting behavior of the Stieltjes transform of their empirical 
spectral distributions. The Stieltjes transform is a convenient and very powerful tool in the study of the 
convergence of spectral distribution of matrices (or operators), just as the characteristic function of a 
probability distribution is a powerful tool for central limit theorems. Most importantly, there is a simple 
connection between the Stieltjes transform of the spectral distribution of a matrix and its eigenvalues. 

By definition, the Stieltjes transform of a measure G on M. is defined as 



m G (z) = [ , for z G C + , 

J x - z 



where C + = CP|{z : Im(z) > 0} is the set of complex numbers with strictly positive imaginary part. 
The Stieltjes transform appears to be known under several names in different areas of mathematics. It is 
sometim es referred to as Cauchy or Abel- Stiel t ies tr ansform. Good references about Sti eltjes transforms 
include dAkhiezerl. Il 96,4 S ections 3.1-2), (|Laxl . I200I Chapter 32), (|Hiai and PetyJ . 1200(1 Chapter 3) and 



( Geronimo and Hil 



Ljims< 

ilH2003). 



For the purpose of this paper, where will consider only compactly supported measures, the following 
results will be needed: 

Fact. Important properties of Stieltjes transforms of measures on R: 

1. If G is a probability measure, mc(z) £ C + if z £ C + and lim^oo —iymciiy) = 1- 

2. If F and G are two measures, and if mp{z) = mciz), for all z £ C + , then G = F, a.e. 

3. \Geronimo and, Hill \20(v\ . Theorem 1): If G n is a sequence of probability measures and mG n (z) 
has a (pointwise) limit m(z) for all z G C + , then there exists a probability measure G with Stieltjes 
transform mc = m if and only i/lim^oo —iym(iy) = 1. // it is the case, G n converges weakly to G. 

4- \Geronimo and Hill. \200A Theorem 2): The same is true if the convergence happens only for an 
infinite sequence {z{}^ =1 in C + with a limit point in C + . 
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5. If t is a continuity point of the cdf of G, dG(t)/dt = lim e ^o ^I T n{ m G{t + 
For proofs, we refer the reader to \Geronimo and Hill. \200A) . 

Note that the Stieltjes transform of the spectral distribution T p of a p x p matrix A p is just 

m r p (z) = ^trace ((A p - zld p ) _1 ) . 

Finally, it is clear that points 3 and 4 above can be used to show convergence of probability measures if 
one can control the corresponding Stieltjes transforms. 

2.3 A fundamental result: the Marcenko-Pastur equation 

In the study of covariance matrices, a remarkable result exists that describes the limiting behavior 
of the empirical spectral distribution, Fqo, in terms of the limiting behavior of the population spectral 
distribution, H^. The connection between these two measures is made through an equation that links 
the Stieltjes transform of the empirical spectral distribution to an integral against the population spectral 
distributi on. We call this equation the Marcenko-Pastur equation because it first appe ared in the land mark 



paper of ( Marcenko and Pasturl. 19 671) . Th e res ult was independe ntly re-discovere d in dWachter L 1$78) and 
then refined in (|Silverstein and Bail . 19951 ) and ( Silversteirl Il995h . In particular, (jSilverstein . Il995h is the 



only paper where the case of a non-diagonal population covariance is tackled. 

In what follows, we will be working with an n x p data matrix X. We call S p = X*X/n and denote 
mp v the Stieltjes transform of the spectral distribution, F p , of S p . We will call vf p the function defined by 
vf (z) = (1 —p/n)^- + ^mF p (z). vf p is the Stieltjes transform of the spectral d istribution of XX* / 



\z) = {I — p/n)-f- + ^mF p {z). vf p is tne Stieltjes transform or tne spectral d istribution ol A A jn. 
Currently, the most general version of the result is found in (|Silversteinl . ll99.^ and states the following: 



1/2 

Theorem 1. Suppose the data matrix X can be written X = YT, P , where S p is a p x p positive definite 
matrix and Y is an n x p matrix whose entries are i.i.d (real or complex), with E(Yij) = 0, E(\Yij\ 2 ) = 1 
and E(\Y id \ 4 ) < oo. 

Call H p the population spectral distribution, i.e the distribution that puts mass 1/p at each of the 
eigenvalues of the population covariance matrix, T, p . Assume that H p converges weakly to a limit denoted 
Hoo- (We write this convergence H p i^oo-J Then, when p,n — > oo, and p/n — ► 7, 7 G (0, oo), 

1. vf p (z) — > v 00 (z), a.s, where Voo(z) is a deterministic function 

2. Voo(z) satisfies the equation 

XdH OQ (X) w _^ + 



,VzeC + (M-P) 



Voo(z) ' J l + Xv^z) 

3. The previous equation has one and only one solution which is the Stieltjes transform of a measure. 

In plain English, under the assumptions put forth in Theorem ^ the spectral distribution of the 
sample covariance matrix is asymptotically non-random. Furthermore, it is fully characterized by the true 
population spectral distribution, through the equation (|M-P|) . 

A particular case of equation ()M-P|) is often of interest: the situation when all the population eigenvalues 
are equal to 1. Then of course, H p = H^ = 5\. A little bit of elementary work leads to the well-known 
fact in random matrix theory that the empirical spectral distribution, F p , converges (a.s) to the Marcenko- 
Pastur law, whose density is given by, if 7 < 1, 

/ 7 (x) = y / (b-x)(x-a)/{2Trxj) , with a = (1 - 7 1/2 ) 2 , b = (1 + 7 1 / 2 ) 2 . 

We refer the reader to ( Marcenko and Pasturl Il96^ ) . (jBail . Il999) and ( Johnstonel 200 lh for more details 



and explanations concerning the case 7 > 1. One point of statistical interest is that even though the 
true population eigenvalues are all equal to 1, the empirical ones are now spread on the interval [(1 — 
7 1 / 2 ) 2 , (1 + 7 1 / 2 ) 2 ]- Plotting the density also shows that its shape vary with 7 in a non-trivial way. These 
two remarks illustrate some of the difficulties that need to be overcome when working under "large n, large 
p" asymptotics. 
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3 Algorithm and Statistical considerations 



3.1 Formulation of the estimation problem 

A remarkable feature of the equation (|M-P|) is that the knowledge of the limiting distribution of the 
eigenvalues in the population given by Hoq fully characterizes the limiting behavior of the eigenvalues of 
the sample covariance matrix. However, the relationship between the two is hard to disentangle. As is 
common in statistics, the question is how to invert this relationship to estimate H p . The question thus 
becomes, given l\,...,l p , the eigenvalues of a sample covariance matrix, can we estimate the population 
eigenvalues, Ai, . . . , A p , using Equation ()M-P|) ? Or in terms of spectral distribution, can we estimate H p 
from F p ? 

Our strategy is the following: 1) the first aim is to estimate the measure appearing in the Marcenko- 
Pastur equation. 2) Given an estimator, -ffoo, of this measure, we will estimate Aj as the i-th quantile of 
our estimated distribution. It is common in statistical practice to get these estimates by using the i/(p + 1) 
percentile and this is what we do. (We come back to possible difficulties getting from H p to Aj in IH.H.fil ) 
3) An important point is that since we are considering fixed distribution asymptotics, our estimate of 
will serve as our estimate of H p , so H p = Hoq. 

The main question, then, is how to approach step 1: estimating based only on F p . Of course, 
since we can compute the eigenvalues of S p , we can compute vf p {z) for any z we choose. By evaluating 
vf p at a grid of values we have a set of values {vF p (zj)}j=i for which equation ()M-P|) should 

(approximately) hold. We want to find that will "best" satisfy equation (jM-P|) across the set of values 
of VF p (zj). In other words, we will pick 

~ ( { 1 P [ XdH(X) ) Jn \ 

H p = Hoc = argmm L < — - + zj ' 

TT I I ))D 17:1 J , , , -r- „,., 



H \\ v F p {zj) nj l + \v Fv (z_ 



where the optimization is over probability measures H, and L is a loss function to be chosen later. In this 
way we are "inverting" the equation ()M-P|) . going from F p , an estimate of F^, to an estimate of -ffoo- 

We will solve this inverse problem in two steps: discretization and convex optimization. We give a 
high-level overview of our method and postpone implementation details to the Appendix. 

To summarize, we face the following interpolation problem: given J an integer and (zj,VF v {zj))j =l we 
want to find an estimate of that approximately satisfies equation (|M-PI) . In Section [51 we show that 
doing so for loss function leads to a consistent estimator of ff^, under the reasonable assumption that 
all spectra are bounded. 

3.2 The algorithm 

In order to alleviate the notations, we will replace the notation by H when it does not cause any 
confusion. 



3.2.1 Discretization 

Naturally, dH can be simply approximated by a weighted sum of point masses: 

K 

dH(x) ~ ^w k 5 tk (x) , 

k=l 

where {tk} k= i is a grid of points, chosen by us, and w^s are weights. The fact that we are looking for a 
probability measure imposes the constraints 

K 

Wk = 1 , and Wk > . 

k=l 
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This approximation turns the optimization over measures problem into searching for a vector of weights 
in . After discretization, the integral in equation (jM-P|) can be approximated by 

XdH{X) t k 



f XdH(X) 

J 7W2>I 



k =i - + tkV 



Hence finding a measure that approximately satisfies Equation ()M-P|) is equivalent to finding a set of 
weights {wk}f = i, for which we have 



K 



Voo(zj) n k^l 1 + tkV 



Naturally, we do not get to observe Voo, and so we make a further approximation and replace Voq by 
vp . Our problem is thus to find {^fcjfLi such that 



K 



~ Z4 > Wh : ,V7 

- k l + t k v F J Zi ) J 



V F P {zj) U t^\ 1 + tkVF P ( Zj 

One good thing about this approach is that the problem we now face is linear in the weights, which 
are the only unknowns here. We will demonstrate that this allows us to cast the problem as a relatively 
simple convex optimization problem. 

3.2.2 Convex Optimization formulation 

To show that we can formulate our inverse problem as a convex problem, let us call the approximation 
errors we make 

1 K * 

+ Zj 



v Fp (zj) nf^ l + v Fp {zj)t k 

As explained above, there are two sources of error in ey. one comes from the discretization of the integral 
involving H^. The other one comes from the substitution of v^, a non-random and asymptotic quantity, 
by vf p , a (random) quantity computable from the data, ej is of course a complex number in general. 

We can now state several convex problems as approximation of the inversion of the Marcenko-Pastur 
equation problem. We show in Section |S] consistency of the solution of the "^oo" version of the problem 
described below. Here are a few examples of convex formulations for our inverse problem. In all these 
problems, the w^s are constrained to sum to 1 and to be non-negative. 

1. "Lqo" version: Find w k s to 

Minimize max max { | Re (e» ) | , | Im (ej ) | } 
j=i,...,J n 

2. "L2" version: Find w^'s to 

J71 

Minimize | ej \ . 

i=i 

3. "L2-squared" version: Find Wk's to 

Jn 

Minimize | ej \ 2 . 
i=l 

The advantages of formulating our problem as a convex optimization problem are many. We will come 
back to the more statistical issues later. From a purely numerical point of view, we are guaranteed that an 
opti mum exists, and fast algorithms are available. In practice, we used the optimization package MOSEK 
(see (|M0SEKL mm)), within Matlab, for solving our problems. 
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Because the rest of the article focuses particularly on the "Loo" version of the problem described above, 
we want to give a bit more details about it. The "translation" of the problem into a convex optimization 
problem is 

min u 

(wi,...,wk,u) 

Vjf, — u < Re (ej) < u 
Vj, — u < Im (ej) < u 

K 

subject to w k = 1 

i=i 

and Wk > 0, Vfc 



This is a linear program (LP) with unknowns (wx, . . . , wk) and u (see ( Bovd and VandenbergheL 20041 ) for 
standard manipulations to make it a standard form LP). 

The simulations we present in Section 0] were made using this version of this algorithm. The proof in 
Section |S] applies to this version of the algorithm. 

3.3 Statistical considerations 

The formulation we proposed is quite flexible and has several important qualities. For instance, reg- 
ularization constraints can be easily handled through our proposal. We also can view the algorithm as a 
form of "basis pursuit" in measure space, from which we can draw some practical conclusions. 

3.3.1 Regularization and constraints 

Methods to invert the Marcenko-Pastur equation should be flexible enough to accommodate reasonable 
constraints that could provide additional improvement to our estimate of H p . The fact that we essentially 
just optimize over the weights w^'s mean that we can easily regularize and add constraints. For instance, 
we might want to regularize our estimator and make it smoother by adding a total variation penalty (on 
the life's) to our objective function. In terms of constraints, we might want to specify that the first moment 
of our estimate H p match the trace of S piv, since we know that the trace of S p /p is a good estimate of 



the trace of S p /p (see e.g (|JonssonL 119821 )'). and that the trace of T, p /p is equal to the first moment of H p . 



Note that constraints on the moments of our estimator are linear in the w^s and so such constraints would 
still lead to a convex problem. The framework we provide can very easily incorporate these two examples 
of penalty and constraints, as well as many others. 

3.3.2 A "basis pursuit" point of view 

A semantic point is needed before we start our discu ssion. We use the term "basis pursuit" in a loose 
sense: we are not referring to the algorithm proposed in (C hen et al. , 1998) but rather use this expression 



as a generic term for describing techniques that aim to optimize the rep resentations of functional objects 
in overcomplete dictionaries. We refer the reader to (Ha stie et m [2001, Chapter 5) for some of the core 



statistical ideas of these so-called basis expansion methods. 

The algorithm we propose can be viewed as a relaxation of a measure estimation problem. We want 
to estimate a measure and instead of searching among all possible probability measures, we restrict 
our search space to mixtures of certain class of probability measures. In 13.2.11 for instance, we restricted 
the choice to mixture of point masses. In that sense, we can view it as a type of "basis pursuit" in 
probability measure space. We first choose a "dictionary" of probability measures on the real line, and we 
then decompose our estimator on this dictionary, searching for the best coefficients. Hence our problem 
can be formulated as 

N 

find the best possible weights {w%, . . . , wn} with dH = WidMi 
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where the Mj's are the measures in our dictionary. 

In the preceding discussion on discretization, we restricted ourselves to Mj's being point masses at 
chosen "grid points". Of course, we can enlarge our dictionary to include, for instance: 

1. Probability measures that are uniform on an interval: dMi(x) = l x& [ a .^dx / {bi — a^). 

2. Probability measures that have a linearly increasing density on an interval [aj,6j] and density 
elsewhere. So dMi(x) = lr a .;,.]2(x — Oj)/(6j — a,) 2 dx, and density elsewhere. 

3. Probability measures that have a linearly decreasing density on an interval [aj,6j], and density 
elsewhere. So dMi(x) = lr ai) {, i ]2(6j — x)/(bi — ai) 2 dx. 

If we decide to include a probability measure M in our dictionary, the only requirement is that we be 
able to compute the integral 

f XdM(X) 
J l + Xv 

for any v in C + . 

Choosing a larger dictionary increases the size of the convex optimization problems we try to solve, 
and hence is at first glance computationally harder. However, statistically, enlarging the dictionary may 
lead to sparser representations of the measure we are estimating, and hence, at least intuitively, lead to 
better estimates of H^. The most favorable case is of course when Hqo is a mixture of a small number of 
measures present in our dictionary. For instance, if has a density whose graph is a triangle, having 
measures as described in points 2 and 3 above would most likely lead to sparser and maybe more accurate 
estimates. In the presence of a priori information on H^, the choice of dictionary should be adapted so 
that has a sparse representation in the dictionary. 



3.3.3 Useful properties of the algorithm 

One important advantage of choosing to estimate measures instead of choosing to estimate a high- 
dimensional vector is that the algorithm's complexity does not increase with the size of the answer required 
by the user. Hence given a p dimensional vector of eigenvalues, once the values VF p (zj) are computed, the 
computational cost of the algorithm is the same irrespective of p. This means that for large p problems, 
only one difficult computation is required: that of the eigenvalues of the empirical covariance matrix. Our 
algorithm is hence, in some sense, "dimension-free", i.e, except for the computation of the eigenvalues, 
it is insensitive to the dimensionality of our original problem. This scaling property is important for 
high-dimensional problems. 

Another good property of our method is that it is independent of the basis in which the data is 
represented. Because our method requires only as input the eigenvalues of the sample covariance matrix - 
quantities obviously independent of the original basis of the data - our method is basis independent. 

In other respects, Theorem^holds for random variables that have a 4-th moment; we are not limited to 
Gaussian random variables. Complex random variables are also possible. Hence, the theorem is well-suited 
for wide applicability. Elementary properties of Gaussian random variables show that Theorem ^ covers all 
possible Gaussian problems. This will not be true for all distributions, but the scope of the theorem is still 
very wide. Note also that the Equation (|M-PI) holds in greater genera lity than mentioned in Theorem ^ 
We refer the reader to the original paper ( Marcenko and PasturlTl967l l for further examples, in particular 



when the data is distributed on spheres or ellipsoids. (The original formulation of the theorem allows for 
dependence between the entries of the matrix Y, but the convergence is not shown to be almost sure.) 



3.3.4 The case p > n and how large is large? 

Another advantage of the proposed method is that it is insensitive to whether p is larger than n or n is 
larger than p. The only requirement is that they both be quite large. We had reasonable to good results in 
simulation as soon as p > 30 or so. As a matter of fact, it is quite clear that to have reasonably accurate 
estimates of the eigenvalues, we need to "populate" the interval [X p , X%] with enough points, for otherwise 
quantile methods may be somewhat inaccurate. 
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3.3.5 On covariance estimation, linear and non-linear shrinkage of eigenvalues 



There is some classical and more recent statistical work on shrinkage of eigenva lues to improve co- 
variance estimation. We refer the reader to Section 4.1 in (jLedoit and W olf. 2004) for some examples 
due to Charles Stein and Leonard Ha f f, unf ortunately in unpublished manuscripts. More recently, in the 
interesting paper by dLedoit and WolA I2004T ) . what was proposed is to linearly shrink the eigenvalues of S p 
toward the identity : i.e k's become k = (1 — p)k + p' s, for some p, independe nt of i, chosen using the data 
and the Marcenko-Pastur law. Then the authors of ( Ledoit and Wold . 12004^ proposed to estimate S p by 
(1 — p)S p + pldp. Since this latter matrix and S p have the same eigenvectors, their method of covariance 
estimation can be viewed as linearly shrinking the sample eigenvalues and keeping the eigenvectors of S p 
as estimates of the eigenvectors of T, p . 

Our method of estimation of the population eigenvalues can be viewed as doing a non-linear shrinkage 
of the sample eigenvalues. While we could propose to just keep the eigenvectors of S p as estimates of 
the eigenvectors of £ p , and hence get an estimate of the population covariance matrix, we think one 
should be able to do better by using the eigenvalue information to drive the eigenvector estimation. It is 
known that in "large n, large p" asymptotics, the eigenv ectors of the sam ple covariance matrix are not 
consistent estimators of the population eigenvectors (see (|Pa,ull . lTo Anneal , even in the most favorable 
cases. However, having a good idea of the structure of the population eigenvalues should help us estimate 
the eigenvectors of the population covariance matrix, or at least formulate the right questions for the 
problem at hand. For instance, the inferred structure of the covariance matrix could help us decide how 
many subspaces we need to identify: if, for example, it turned out that the population eigenvalues were 
clustered around two values, we would have to identify two subspaces, the dimensions of these subspaces 
being the number of eigenvalues clustered around each value. Also, having estimates of the eigenvalues 
tell us how much variance our "eigenvectors" will have to explain. In other words, our hope is that taking 
advantage of the crucial eigenvalue information we are now able to gather will lead to better estimation of 
E p by doing a "reasoned" spectral decomposition. Work in this direction is in progress. 



3.3.6 Asymptotics at fixed spectral distribution and isolated eigenvalues 

Our algorithm actually uses asymptotics assuming a fixed spectral distribution: we are essentially fixing 
H p = H^ when solving our optimization problem. Naturally, this does not mean that p is fixed. Note that 
this is what is classically done is statistics: for the simple problem of estimating the mean of a population 
from a sample Z\, . . . , Zjc, it is common to assume that the Z^s have the same mean p, and that p does 
not depend on K. However, when studying the asymptotic properties of this simple estimator, we could 
require to actually have p(K), with p(K) —* p. (All we would have to do is have a triangular array of 
data, and getting to observe just one row of this array at a time.) Hence our fixed spectral distribution 
"assumption" is very natural and similar to classical assumptions made in estimation problems. 

Let us go back now to the problem of isolated eigenvalues. Suppose we get to see data in MP for 
some pq. Then, any isolated eigenvalue that may be present is numerically treated as if the mass that 
is attached to it is held fixed at 1/po when p — > oo. So a point mass at the corresponding population 
eigenvalue would appear in H p . This has been verified numerically. If the estimator were perfect, this 
mass should be equal to 1/po- However, because of variability it may not be exactly of mass 1/po- Then, 
estimating the population eigenvalues by the quantiles of the estimated population spectral distribution, 
we may "miss" this isolated eigenvalue. In the case of the largest eigenvalue, that would happen if the mass 
found numerically at this isolated eigenvalue is less than l/(po + 1). So isolated eigenvalues will require 
special care and caution, particularly in going from H p to Aj. While the method focuses on identifying the 
structure of the population eigenvalues and hence may have problems when it comes to estimating isolated 
eigenvalues, we have found in practice that it still provided a good tool for this task but that some care 
was required. 



3.3.7 Existing related work 

As far as we know, there has been no work on non-parametric estimation of H„ or H^ using the 
Marcenko-Pastur equation. However, some work exists in the Physics' literature (fB urda et al. 
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2005n l. that takes advantage of the Marcenko-Pastur law to estimate some momen ts of JJ^, H, 



assumed to a be a mixture of a finite and pre-specified number of point masses (see ([Burda et al 



is th en 
120041 p. 

303)) and the moments are then matched with possible point masses and weights. While these methods 
might be of some use sometimes, we think they require too many assumptions to be practically acceptable 
for a broad class of problems. It might be tempting to try to devel op an non-parametr ic estimator from 
moments, but we think that without the strong assumptions made in ( Burda et al. . 2004h . those estimators 
will suffer drastically from: 1) the number of moments needed a priori may be large, and large moments 
are very unreliable estimators; 2) moments estimated indirectly may not constitute a genuine family of 
moments: certain Hankel matrices need to be positive semi-definite and will not necessarily be so. Semi- 
definite programming type corrections will then be necessary, but hard to implement. 3) Even if one has a 
genuine moment sequence, there are usually many distributions with the same moments. Choosing between 
them is clearly going to be a difficult task. 



4 Simulations 

We now present some simulations to illustrate the practical capabilities of the method. The objectives 
of eigenvalues estimation are many- folds and depend of the area of applications. We review some of those 
that inspired our work. 

In settings like PCA, one basically wishes to discover some form of structure in the covariance matrix by 
looking at the eigenvalues of the sample covariance matrix. In particular, a situation where the population 
eigenvalues are different from each other indicates that projecting the data in some projections will be more 
"informative" that projecting it in other directions; while in the case where all the population eigenvalues 
are equal, all projections are equally informative or uninformative. As our brief discussion of the Marcenko- 
Pastur law illustrated, in the "large n, large p" setting, it is difficult to know from the sample eigenvalues 
whether all population eigenvalues are equal to each other or not, or even if there is any kind of structure 
in them. When p and n are both large, standard graphical methods like the scree plot tend to look similar 
whether or not there is structure in the data. We will see that our approach is able to differentiate between 
the situations. Among other things, our method can thus be thought as a alternative to the scree plot for 
high-dimensional problems. 

In other applications, one focuses more on trying to estimate the value of the largest or smallest 
eigenvalues. In PCA, the largest population eigenvalues measure how much variance we can explain through 
a low dimensional projection and is hence important. In financial applications, like the Markovitz' portfolio 
optimization problem, the small population eigenvalues are important. They ess entially measure wh at is 
the minimum risk one can take by investing in a portfolio of certain stocks (see ( Laloux et al. , 1999^ and 



( Campbell et al Chapter 5)). However, as explained in the Appendix, the largest eigenvalue of the 



sample covariance matrix tends to overestimate the largest eigenvalue of the population covariance. And 
similarly, the smallest eigenvalue of the sample covariance matrix tends to underestimate its population 
counterpart. What that means is that using these measures of "information" and "risk", we will tend to 
overestimate the amount of information there is in our data and tend to underestimate the amount of risk 
there is in our portfolios. So it is important to have tools to correct this bias. Our estimator provides a 
way to do so. 



4.1 Details of the simulations 

We illustrate the performance of our method on three cases, each with very different covariance struc- 
ture. We will give more details on each individual case in the following subsections. 

We now describe more precisely these examples. The first case is that of S p = Id p , in other words, 
there is no "information" in the data. However standard graphical statistical methods like the "scree plot" 
will tend to show a pattern in the eigenvalues. We will show that our method is generally able to inform 
us that all the eigenvalues are equal. 

The second case is one where S p has 50% of its eigenvalues equal to 1 and 50% equal to 2. While 
it should be easy to discern that there are two very distinct clusters of eigenvalues in the population, in 
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high-dimension the sample eigenvalues will often blur the clusters together. We show that our method 
generally recovers these two clusters well. 

Finally, the third example is one where S p is a Toeplitz matrix. More details on Toeplitz matrices 
are given in 14.1.31 This situation poses a harder estimation problem. While the asymptotic behavior of 
the eigenvalues of such matrices is well understood, there are generally no easy and explicit formulas to 
represent the limit. We present the results to show that even in this difficult setting, our method performs 
quite well. 

To measure the performance of our estimators, we compare the Levy distances between our estimator, 
Hp, and the true distribution of the population eigenvalues, H p , to that of the empirical spectral distribu- 
tion, F p , to Hp. Our choice is motivated by the fact that the Levy distance can be used as a metric for 
weak convergence of distributions on R. Recall (see e.g 
two distributions F and G on the real line is defined as 

d L (F, G) = inf{e > : F(x - e) - e < G(x) < F(x + e) + e , Vx} . 

In the plots we will depict the cumulative distribution function (cdf ) of our estimated measures. Recall 
that the estimates of the population eigenvalues Aj's are obtained by taking appropriate percentiles of these 
measures. 

4.1.1 The case S p = Id p 

In this situation, the Marcenko-Pastur law predicts that instead of being concentrated at 1 like the 
population eigenvalues, the sample eigenvalues will be spread on the interval [(1 — yp/n) 2 , (1 + yp/n) 2 ]. 
This is problematic, since by looking at the scree plot of just the sample eigenvalues, one might think 
that some population eigenvalues are (much) larger than others and hence some projections of the data 
are more informative than others. This is vividly illustrated on Figure Hal However, as we see on Figure 
ITcl the method we propose finds that the population spectral distribution is very close to a point mass 
at 1, and all eigenvalues are thus close to 1. Statistically, this of course means that there is no preferred 
direction to project the data. All directions are equally informative, or uninformative. 

The figures presented in Figure n were chosen at random among 1000 Monte-Carlo simulations and are 
very encouraging. To further our empirical investigation of the performance of our method, we repeated 
the estimation process 1000 times. Another advantage is that on further investigation (manually checking 
the graphs of many of the estimators we obtained) we saw that the estimator consistently gets the structure 
"right", namely a huge spike in the vicinity of 1. This is of course very important for applications such 
as PCA, where the structure of the spectrum of the covariance matrix is of fundamental importance. For 
each repetition, we estimated the distribution of the eigenvalues in the population, and computed the Levy 
distance of our estimator, H p , to the true distribution, H p , in this case a point mass at 1. We did the 
same for the empirical spectral distribution F p . Figure shows the ratio d^{Hp, H p ) / d,L(F p , H p ) for these 
simulations. Our estimator clearly outperforms the one derived from the sample covariance matrix, often 
by a dramatic factor. 

4.1.2 The case H p = .5^ + M 2 

In this case the eigenvalues of the population covariance matrix are split into two clusters of equal size. 
For the specific example we investigate, 50% of the eigenvalues are equal to 1 and 50% are equal to 2. 

While it should be easy to discern that there are two very distinct clusters of population eigenvalues, 
when p is sufficiently close to n the two clusters merge together and the scree plot of the sample eigenvalues 
does not show a clear separation between the two regions. The Marcenko-Pastur law predicts (in the case 
of identity covariance) that the sample eigenvalues spread over larger and larger intervals as p gets closer to 
n. Therefore, it is intuitively not surprising that when we have two not too distant clusters of population 
eigenvalues, the corresponding sample eigenvalues would start to overlap if p is close enough to n. 

We did a Monte Carlo analysis (similar to the one done in the case of Id p covariance) of our estimator 
and did comparisons to the empirical spectral distribution. As in the case of Id p , we present a figure 
showing the ratio of the Levy distance of the two estimates to the true distribution. Figure 0] shows that 



Du rrettl . 1 996^ that the Levy distance between 
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Scree Plot eigenvalues sample covariance matrix 



CDF eigenvalues sample covariance matrix 
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p=1u0 
cov =id 
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p=100 



(a) Eigenvalues (scree plot) of the sample covariance matrix (b) CDF eigenvalues, sample covariance matrix (F p ) 

Estimated CDF eigenvalues population covariance 
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(c) CDF eigenvalues, estimated population covariance ma- 
trix (Hp) 

Figure 1: case S p = Id p . The three figures above compare the performance of our estimator to the one 
derived from the sample covariance matrix on one realization of the data. The data matrix X is 500 x 100. 
All its entries are iid AA(0, 1). The population covariance is S p = Idioo, so the distribution of the eigenvalues 



is a point mass at 1. This is what our estimator (Figure (c) ) recovers. Average computation time (over 
1000 repetitions) was 13.33 seconds, according to Matlab tic and toe functions. Implementation details 
are in the Appendix. 
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Ratio of Levy 
Metrics 

Empirical/ Adjusted 
1000 repetitions 



n=500 
p=100 



Figure 2: case £ p = Id p : Ratios cIl(H p , H p ) / cIl(F p , H p ) over 1,000 repetitions. Dictionary consisted of 
only point masses. Large values indicate better performance of our algorithm. All ratios were found to be 
larger than 1. 
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Scree plot eigenvalues sample covariance matrix 



CDF eigenvalues sample covariance matrix 



n=500 
p=100 

50 eigenvalues =1 
50 eigenvalues =2 



(a) Scree plot of eigenvalues, sample covariance matrix: no 
clear separation around the 50th eigenvalue 



n=500 
0.8 - p=500 

50 eigenvalues =1 
50 eigenvalues =2 



(b) CDF eigenvalues sample covariance matrix (F p ) 
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(c) Estimated CDF of eigenvalues of population covariance 
matrix (H p ) 

Figure 3: case H p = .5(5i + .5^2 : the three figures above compare the performance of our estimator on 
one realization of the data. The data matrix Y is 500 x 100. All its entries are iid JV(0, 1). The covariance 
is diagonal and has spectral distribution H p = .5<5i + .552 • in other words, 50 eigenvalues are equal to 1 



and fifty eigenvalues are equal to 2. This is essentially what our estimator (Figure (c) ) recovers. Average 
computation time (over 1000 repetitions) was 15.71 seconds, according to Matlab tic and toe functions. 



Ratio Levy metrics 

2 point {1 ,2) covariance 




Figure 4: case H p = .5<5i + .5<52 : Ratios dL(Hp,H p )/dL(F p ,Hp) over 1,000 repetitions. Dictionary 
consisted of only point masses. Large values indicate better performance of our algorithm. All ratios were 
found to be larger than 1. 
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Scree plot eigenvalues sample covariance matrix Empirical and Population spectral distributions 
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(a) Scree plot, Eigenvalues sample covariance matrix (b) CDF eigenvalues sample covariance matrix (F p ) 
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Toeplitz matrix: 
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(c) Estimated CDF of eigenvalues of population covariance 
matrix (H p ) 

Figure 5: case S p Toeplitz with entries. 3l 1— J 'l : the three figures above show the performance of our 
estimator on one realization of the data. The data matrix Y is 500 x 100. All its entries are iid M(0, 1). The 
covariance is Toeplitz, with t(\i — j\) = .3^~^. In Figure \{cj\ we superimpose our estimator (blue curve) 
and the true distribution of eigenvalues (red curve). Average computation time (over 1000 repetitions) was 
16.61 seconds, according to Matlab tic and toe functions. 



once again our estimator clearly outperforms the one derived from the sample covariance matrix, by a 
large factor. Again, upon further investigation, the estimator generally gets the correct structure of the 
distribution of the population eigenvalues: in this case two spikes at 1 and 2. 

4.1.3 The case of a Toeplitz covariance matrix 

Finally, we performed the same type of analysis on a Toeplitz matrix, to show that the method we 
propose works quite well on more complicated types of covariance structures. Note that generally this is 
inherently a quite difficult problem, if we do not assume a priori that we know that the matrix is Toeplitz. 

We recall that a Toeplitz matrix T is a matrix whose entries satisfy Tjj = t(i — j), for a certain function 
t. Since covariance matrices are symmetric, the Toeplitz matrices at hand will s atisfy Tj^ = t(\i — The 
limit in g spectral d istri bution of these objects are v ery well understood: see ( Bottcher and SilbermarmL 
jOravlEnd) or (jOrenander and Szeeml. FT" 



Approaches exist t hat take advantage of the particular structure of a Toeplitz matrix. See for instance, 
th e interesting papers (fBickel and Levinal . 2004h and for even more generality - beyond Toeplitz matrices 



jRickel and Levini H)- However, these approaches are very basis dependent-; they assume that the 
variables are measured in the appropriate basis. In data analysis, this may sometimes be justified and 
sometimes not. In particular, if the order of the variables is permuted, the resulting estimators might 
change. Since we want to be able to avoid this type of behavior, we feel that a "basis independent" method 
is needed and should be available. Finding such a method was one of the original motivations of our 
investigations. 

Once again, the results displayed in Figure are quite encouraging. Note that this time, the population 
spectral distribution could only be approximated by a large number of elements of our dictionary. So there 
was no sparse representation of in our chosen dictionary of measures. However, computation time was 
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Ratio Levy metrics 
Toeplitz covariance 

n=500 



Figure 6: Case S p Toeplitz with entries (.3' 1- - 7 '): Ratios di(H p , H p )/dL(F p , H p ) over 1,000 repetitions. 
Dictionary consisted of only point masses. Large values indicate better performance of our algorithm. All 
ratios were found to be larger than 1. 

not severely affected and the results are still quite good. To give a more detailed comparison, we present 
in Figure EDa histogram of ratios (Il(H p , H p ) / <1l{F p , H p ). 

5 Consistency 

In this section, we prove that the algorithm we propose leads to a consistent (in the sense of weak 
convergence of probability measures) estimator of the spectral distribution of the covariance matrices of 
interest. 

More precisely, we focus on the "Loo" version of the algorithm proposed in lH.2.21 In short, the theoretical 
results we prove state that as our computational resources grow (both in terms of size of available data and 
grid points on which to evaluate functions), the estimator H p converges to H^. The meaning of Theorem 
121 which follows, is the following. We first choose a family of points {zj} in the upper-half of the complex 
plane, with a limit point in the upper-half of the complex plane. We assume that the population spectral 
distribution H p has a limit, in the sense of weak convergence of distributions, when p — > do. We call this 
limit Hoq. This assumption of weak convergence allows us to vary H p , as p grows, and to not be limited 
to H p = Hqo for the theory; this provides maximal generality. We then solve the "Loo" version of our 
optimization problem, by including more and more of the Zj's in the optimization problem as n —* oo. 
We assume in Theorem [21 that we can solve this problem by optimizing over all probability measures. 
Then Theorem shows that the solution of the optimization problem, H p , converges in distribution to the 
limiting population spectral distribution, H^. In Corollary ^ we show that the same conclusion holds if 
the optimization is now made over probability measures that are mixture of point masses, whose locations 
are on a grid whose step size goes to with p and n. Actually, the requirement is that the dictionary of 
measures we use contain these diracs. It can of course be larger. Hence, Corollary ^ proves consistency 
of the estimators specifically obtained through our algorithm. Beside the assumptions of Theorem ^ we 
assume that all the spectra of the population covariances are (uniformly) bounded. That translates into 
the mild requirement that the support of all H p s be contained in a same compact set. Note that in the 
context of asymptotics at fixed spectral distribution, this is automatically satisfied. 

We now turn to a more formal statement of the theorem. The notation B(zq, r) denotes the closed ball 
of center zq and radius r. Our main theorem is the following. 

Theorem 2. Suppose we are under the setup of Theorem^ H p and p/n — > 7, with < 7 < 00. 

Assume that the spectra of the S p 's are uniformly bounded. Let J\, J2, . . . , be a sequence of integers tending 
to 00. Let zq G C + and r £ R + be such that B(zQ,r) C C + . Let z\, Z2, ■ ■ ■ be a sequence of complex variables 
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with an accumulation point, all contained in B(zq,t). Let H p be the solution of 



Hp = argminmax 



■ P 

+ Zj 



XdH(X) 



VF p {zj) 3 nj l + Xv Fp ( Zj ) 
where H is a probability measure. Then we have 



(1) 



Hp => Hqo , a.s . 

Before we turn to proving the theorem, we need a few intermediate results. An important step in the 
proof is the following analytic lemma. 

Lemma 1. Suppose we have a family {zi}^ 1 of complex numbers in C + , with an accumulation point in 
C + . Suppose there exist a sequence {Ji} c *l l of integers tending to oo, a sequence {ej}^ 1 of positive reals 
tending to 0, a sequence {p(n)}^ =1 of integers, with p(n)/n — > 7 G M*,, and a sequence of probability 
measures {H p } p x L 1 such that 



Vj < J n , 



V Fp {Zj 



1 p 

— + z j-- 



n 



XdHp(X) 



Assume that Vqo satisfies 



1 + Xvp^Zj) 

XdHpojX) 
1 + Xv 00 (z j ) 



(2) 



(3) 



Voo(zj) 

for some probability measure H^. Assume that VF p (zj) — ► v !X (zj), and both are analytic in C + and from 
C + to C + . Further, assume that Iv^Zj)] < C for some C 6 R, and \Im(vp p (zj)) \ > 5, as well as 
\Im(voo{zj)) I > 5, for some 5 > 0. Then 

H„ =>• H^ ■ 



Proof. Since Voo satisfies 

equation @ reads 

1 1 



VF p (Zj) VooiZj) 



7 

n 



Voo(Zj) + Zj 7 

p\ f XdH QO (X) 



0. 



+ P - 

l + Xv^Zj) n 



XdHccjX) 

1 + XVoo(Zj) 

XdH^X) 



XdHp(X) 
1 + XVoo (Zj) J 1 + Xv Fp (Zj ) 



< e n 



Note that since |Im (v Fp ) \ > 5 and |Im (voo) \ > S, and given that 



< 



\VF P ~ «oo| 



|Im (v Fp ) Hlm^oo) I 



we have \1/vf p - l/«oo| —* 0. 

Also, because p/n — > 7, the previous equation implies that 



XdH^jX) 

1 + XVooizj] 



Now because VF p (zj) — ► Uoo(zj), we have 



XdHp(X) 



XdHp(X) 
l + Xv Fp (zj) 



. 



Acitf p (A) 

1 + ^Vp p (Zj) J l + XVoo(Zj) 



A 2 (foo(%-) - VF p (zj))dH p (X) 



< 



(1 + XVooiZj))^ + Xv Fp ( Zj )) 
\ v F p (zj) -Voo(Zj)\ 

|Im (v Fp {zj)) \\Im (v^Zj)) \ 



. 



18 



So we have 

XdH p (X) f XdH^X) 



l + XVoo(Zj) J l + Xv^Zj) 

We remark that for m E C + , and G a probability measure on R, whose Stieltjes transform is denoted by 



XdG(X) 1 1 f dG(X) 1 1 f dG(X) 1 1/1 

n&G 



Sg, 

I , , , ■ 

J 1 + Am m m J 1 + Am m m 2 J 1/m + A m m 2 \ m 
Hence, when the assumptions of the lemma are satisfied, we have 

S S P ( —I -» S ^ 



Now since foo(^j) satisfies Equation Q, we see that if Voo(^j) = ^oo(^jfc)) then Zj = z^. Hence, {— l/v 00 (zj)y*L 1 
is an infinite sequence of complex numbers in C + . Moreover, because Voo is analytic in C + , it is continuous, 
and so {— ^-/v 00 (zj)}^ x has an accumulation point. Further, because |t>oo(zj)| < °o and lm(v ocl (zj)) > 5, 
this accumulation point is in C + . 

So under the assumptions of the lemma, we have shown that there exist an infinite sequence {yj}^Li 
of complex numbers in C + , with an accumulation point in C + , such that 

s s p (yj) -» s Hjyj) ,vj . 

According to ( Geronimo and Hill 2003h . Theorem 2, this implies that 

Hp Hqo . 

□ 

In the context of spectrum estimation, the intuitive meaning of the previous lemma is that if for a 
sequence of complex numbers {zj}JL 1 with an accumulation point in C + , we can find a sequence of H p 7 s 
approximately satisfying the Marcenko-Pastur equation at more and more of the z^s when n grows, then 
this sequence of measures will converge to . 

We now state and prove a few results that will be needed in the proof of Theorem |5J The first one is a 
remark concerning Stieltjes transforms. 

Proposition 1. The Stieltjes transform, Sh, of any probability measure H on R, is Lipschitz l/u^ n on 
C + n {Im{z) > U m i n }. 

Hence, if Sh„ (z) — ► Sh^ (z) pointwise, where all the measures considered are probability measures, the 
convergence is uniform on compact subsets of C + H {Im(z) > ti m i n }. 

Proof. We first show the Lipschitz character of Sh- We have 

Sh(zi) - S H (z 2 ) = [ ( -i — L_ ) dff(A) = (Z! - z 2 ) 

J \X-zi X-z 2 J 

Now |A — Z\\ > |Im (A — z\) | > u m \ n . So 

\zi - z 2 \ 



dH(X) 



(X- Zl )(X-z 2 ) 



\S H (zi) - S H (z 2 )\ < 



u 2 ■ 

mm 



So we have shown that Sh is uniformly Lipschitz l/ii min on C + P|{ im ( z ) > u min}- 

Now, it is an elementary and standard fact of analysis that if a sequence of if-Lipschitz functions 
converge pointwise to a K-Lipschitz function, then the convergence is uniform on compact sets. This 
shows the uniform convergence part of our statement. □ 

In the proof of the Theorem, we will need the result of the following proposition. 
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Proposition 2. Assume the assumptions underlying Theorem^\are satisfied. Recall that vf is the Stieltjes 
transform of F p , the spectral distribution of XX* jn = YT, p Y* jn. Assume that the population spectral 
distribution H p has a limit and that all the spectra are uniformly bounded. Let z £ B(zQ,r), with 
B(zQ,r) C C + . Then, almost surely, 

3N, n > N => inf Im (v Fp (z)) = 5 > . 

n,2GB(zo,r) 



Proof. Since we assume that all spectra are bounded, we can assume that the population eigenvalues are 

cause the spectral norm is a matrix nori 

X max (X*X/n) < A max (X p )A max (Y*y/n) 



,1/2 

all uniformly bounded by K. Because the spectral norm is a matrix norm and X = YY> P , we have 



Now it is a standard result in random matrix theory that, A max (y*Y/n) — > (1 + 7) 2 , a.s, so for n large 
enough, 

A max (r*y/n) < 2(1 + 7 ) 2 a.s . 



Calling z = u + iv, we have 



/ vdF p (X) f vdF p {\) 

Im [y Fp {z)) - / > 



{X-u) 2 + v 2 - J 2{X 2 +u 2 ) + v 2 ' 

because v > 0. Now, the remark we made concerning the eigenvalues of X*X/n implies that almost surely, 
for n large enough, F p puts all its mass within [0, C], for some C. Therefore, 

imK(.))> c2+ ; 2+2 ^ 2 , 

and hence Im {vF p {zf) is a.s bounded away from 0, for n large enough. □ 

To show that we can find "good" probability measures when solving our optimization problem, we will 
need to exhibit a sequence of measures that approximately satisfy the Marcenko-Pastur equation. The 
next proposition is a step in this direction. 

Proposition 3. Let r 6 M + and zq £ C + be given and satisfying B{z$,r) C C + . Suppose p/n —* 7 when 
n — > 00, and \/e3N : n > N =^> Vz £ B(zQ,r), \vf p (z) — Voo(z)\ < e, where satisfies equation 
Suppose further that \Im (voo(z)) \ > u m \ n on B(zo,r). Then, if e < u mm /2, 



3N' £ N, Vz G B(z ,r), Vn > N', 



Proof. Using equation © we find that 



1 +z _P f AdF 00 (A) 



vf p (z) n J l + \v Fp {z 



1 + 27 
< 2e- 



u 2 

"mm 



A„(z) = ^- + Z 



vf p (z) ' nj 1 + Xvf p {z) 

1 +§^tt4^-tt^tt|.»/x 1 .M 



wf p (2) foo(^) W \1 + Xv 00 (z) l + Xv Fp (z) 
Because 7 — p/n — > 0, and | A/(l + Afoo(z))| < l/|Im (foo(^)) | < l/"U m i n , we have 



- -) / 1 , ^ / , rfiJoo(A) -> uniformly on B(z ,r) . 
V nJ J l + Xvoo^z) 
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Now, of course, 



,{z)-v Fp {z) p 



vf p {z)voo(z) n 
We remark that |vp > |Im (v Fp (z)) | > -u m ; n 



(v Fp (z) - V^Z)) 



(l + \v Fp (z)){l + \ VoD (z)) 
e > u m j n /2. Hence, if n is large enough 



|A^)|<2 



\Voc{z) ~ V Fp (z)\ p \v Fp ( Z ) - Voo{z)\ 



+ 2^- 

n 



< e- 



U7, 



H 7. 



(1 + 2 7 ) 



□ 



We now turn to proving Theorem [2 



Proof of Theorem According to Propositions ^ and HI the assumptions put forth in Proposition |3] 
are a.s satisfied for v Fp and Voo is the Stieltjes as in Theorem^ Note also that Theorem ^ states that a.s, 
v Fp (z) — > Voo(z), and that all these functions are analytic in C + . In other words, they have the properties 
needed for Lemma ^ to apply. 

In particular, Proposition El implies that if {zj} is a family of complex numbers included in B(zo,r), 
and if H p is the solution of equation equation © will be satisfied almost surely, with a family {ej} of 
positive real numbers that converge to 0. According to Lemma this implies that, 



H„ Hoc 1 almost surely. 



As a corollary of Theorem we are now ready to prove consistency of our algorithm. 



□ 



Corollary 1 (Consistency of proposed algorithm). Assume the same assumptions as in Theorem 
Call Hp the solution of equation ^j, where the optimization is now over measures which are sums of 
atoms, the location of which are restricted to belong to a grid ( depending on n) whose step size is going to 
as n — > 00. Then 

Hp => Hoc a.s . 

Proof. All that is needed is to show that a discretized version of Hqq furnishes a good sequence of measures 
in the sense that Proposition |21 holds for this sequence of discretized version of H^ . 

We call Hm„ a discretization of H^ on a regular discrete grid of size 1/M„. For instance, we can 
choose Hm„(x) to be a step function, with Hm„(x) = H OQ (x) is x = l/M n , I £ N, and Hm u is constant on 
[l/M n , (I + 1)/M n ). Recall also that Hoq is compactly supported. 

In light of the proof of Proposition 13 for the corollary to hold, it is sufficient to show that uniformly 
in z £ B(z ,r), 

A -dH Mn (X)- [ — — — —dH 00 (X) 



l + Xv Fp (z) 



l + Xv F (z) 



. 



Now calling dw {Hm„, Hqo) the Wasserstein distance between % n and H^, we have 



dw (Hm„ , H 



f 

Jo 



\Hm„(x) — i7oo(x)| dx — > as n — » 00 . 



{HM n and H^ put mass only o n M + , so the previous integral is restricted to M + . We refer the reader to 
the survey ( Gibbs and Sul 2001 ) for properties of different metrics on probability measures.) 

In other respects, it is easy to see that under the assumptions of Proposition |31 there exists iV~ such 
that, sup n> jv )Z eB( Z0) r) 1^(^)1 < K, for some K < 00. Recall also that under the same assumptions, 
inf n >N,zeB(z ,r) Im (v Fp (z)) > 5, for some 5 > 0. 

For two probability measures G and H, we also have 



dw {G, H) = sup 
/ 



fdG - / fdH 



; / a 1-Lipschitz function 
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Hence, because and Hm„ are supported on a compact set that is independent of n, to have the result 
we want, it will be enough to show that 

fv Fp (z)W 



l + Xv Fp (z) 



is uniformly Lipschitz (as a function of A) when z £ B(zo,r) and n > N. 
Now note that 

/^ W (Ai) - /^(z)(A 2 ) = (1 + AlUFp(z))(1 + A2 ^ (z)) • 

If A < 1/(21?), then |Auf p (z)| < 1/2, so |1 + Xv Fp (z)\ > 1/2. If A > 1/(21?), then |1 + Xv Fp (z)\ > 
Xlm(v Fp {z)) > S/(2K). So |1 + Xv Fp (z)\ > mm(l/2,S/{2K)) = C. Hence f Vp {z) is 1/C 2 -Lipschitz, and C 
is uniform in n and z, as needed. 

Having thus extended Proposition El to discretized versions of , the proof of the corollary is the 
same as that of Theorem |21 □ 

The proof of the corollary makes clear that when solving the optimization problem over any dictionary 
of probability measures containing point masses (but also possibly other measures) at grid points on a grid 
whose step size goes to 0, the algorithm will lead to a consistent estimator. 

Finally, as explained in the Appendix, the algorithm we implemented start with v Fp (zj) sequences, as 
opposed to simply Zj sequences. It can be straightforwardly adapted to handle the Zj's as a starting point, 
too, but we got slightly better numerical results when starting with v Fp (zj). The proof we just gave could 
be adapted to handle the situation where the v Fp (zj) , s are used as starting point. However, a few other 
technical issues would have to be addressed that we felt would make the important ideas of the proof less 
clear. Hence we decided to show consistency in the setting of Corollary ^ 



6 Conclusion 

In this paper we have presented an original method to estimate the spectrum of large dimensional 
covariance matrices. We place ourselves in a "large n, large p" asymptotic framework, where both the 
number of observations and the number of variables is going to infinity, while their ratio goes to a finite, 
non-zero limit. Approaching problems in this framework is increasingly relevant as datasets of larger and 
larger size become more common. 

Instead of estimating individually each eigenvalue, we propose to associate to each vector of eigenvalues 
a probability distribution and estimate this distribution. We then estimate the population eigenvalues as 
the appropriate quantiles of the estimated distribution. We use a fundamental result of random matrix 
theory, the Marcenko-Pastur equation, to formulate our estimation problem. We propose a practical 
method to solve this estimation problem, using tools from convex optimization. 

The e stimator has go od practical properties: it is fast to compute on modern computers (we use the 



ine estimator Uas good practical properties: it is last to compute on modern computers (we use tne 
software (MOSEK, 2006) to solve our optimization problem) and scales well with the number of parameters 



to estimate. We show that our estimator of the distribution of interest is consistent, where the appropriate 
notion of convergence is weak convergence of distributions. 

The estimator performs a non-linear shrinkage of the sample eigenvalues. It is basis independent and 
we hope will help in improving the estimation of eigenvectors of large dimensional covariance matrices. To 
the best of our knowledge, our method is the first that harnesses deep results of random matrix theory to 
practically solve estimation problems. We have seen in simulations that the improvement it leads to are 
often dramatic. In particular, it enables us to find structure in the data when it exists and to conclude to 
its absence where there is none, even when classical methods would point to different conclusions. 

APPENDIX 
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A.l Implementation details 

We plan to release the software we used to create the figures appearing in the simulation and data 
analysis section in the near future. However, we want to mention here the choices of parameters we made 
to implement our algorithm. The justifications for them is based on intuitions coming from studying the 
equation (|MT|) . 

Scaling of the eigenvalues If all the entries of the data matrix are multiplied by a constant a, then the 
eigenvalues of £ p are multiplied by a 2 , and so are the eigenvalues of S p . Hence, if the eigenvalues of S p are 
divided by a factor a, Equation ()M-P|) remains valid if we change H QO (x) into H OCJ (ax). In practice, we scale 
the empirical eigenvalues by h the largest eigenvalue of S p . We solve our convex optimization problem 
with the scaled eigenvalues to obtain H^hx), from which we get H^x) through easy manipulations. 
The subsequent details describe how we solve our convex optimization problem, after rescaling of the 
eigenvalues. 

Choice of (zj,v(zj)) We have found that using 100 pairs (zj,v(zj)) was generally sufficient to obtain good 
and quick (10s-60s) results in simulations. More points is of course better. With 200 points, solving the 
problem took more time, but was still doable (40s-3mins). In the simulations and data analysis presented 
afterwards, we first chose the v(zj) and numerically found the corresponding Zj using Matlab's optimization 
toolbox. We took v(zj) to have a real part equally spaced (every .02) on [0, 1], and imaginary part of 10 -2 
or 10 -3 . In other words, our v(zjYs consisted of two (discretized) segments in C + , the second one being 
obtained from the first one by a vertical translation of 9 * 10~ 3 . 

Choice of interval to focus on The largest (resp. smallest) eigenvalue of a p x p symmetric matrix S 
are convex (resp. concave) functions of the entries of the matrix. This is because h(S) = sup|| u || 2=1 u'Su, 
where u is a vector in MP. Hence h(S) is the supremum of linear functionals of the entries of the matrix. 
Similarly, l p (S) = infi| u i| 2= i u'Su, so l p (S) is a concave function of the entries of S. Note that the sample 
covariance matrix S p is an unbiased estimator of X p . By Jensen's inequality, we therefore have E(h(S p )) > 
h(E(S p )) = Ai(Sp). In other words, h(S p ) is a biased estimator of Ai(£ p ), and tends to overestimate it. 
Similarly, l p (S p ) is a biased estimator of A p (£ p ) and tends to underestimate it. More detailed studies of 
h and l p indicate that they do not fluctuate too much around their mean. Practically, as n — ► oo, we 
will have with large probability, l p < X p and h > X\. (In certain cases, concentration bounds can make 
the previous statement rigorous.) Hence, after rescaling of the eigenvalues, it will be enough to focus on 
probability measures supported on the interval [l p /h, 1] when decomposing H^ilix). 

Choice of dictionary In the "smallest" implementation, we limit ourselves to a dictionary consisting of 
point masses on the interval [l p /h, 1], with equal spacing of .005. We call £ p the length of this interval. In 
larger implementations, we split the interval [l p /h, 1] into dyadic intervals, getting at scale k, 2 k intervals: 
[lp/h +j2~ k (p, lp/h + (j + l)2 _fc Cp]i for j = 0, . . . , 2 fc — 1. We store the end points of all the intervals at all 
the scales from k = 2 to k = 8 for the coarsest implementation and up to 10 for the finest We implemented 
dictionaries containing: 

1. Point masses every .005 on [l p /h,l], and probability measures supported on the dyadic intervals 
described above that have constant density on these intervals. 

2. Point masses every .005 on [l p /h,l\, and probability measures supported on the dyadic intervals 
described above that have constant density on these intervals, as well as probability measures on 
those dyadic intervals that have linearly increasing and linearly decreasing densities. 

The simulations presented above were made with this latter choice of dictionary using scales up to 8. 
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